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ABSTRACT 

A Monte Carlo simulation study was conducted to 
investigate the effects of sample size, estimation method, and model 
specification on structural equation modeling (SEM) fit indices. 

Based on a balanced 3x2x5 design, a total of 6,000 samples were 
generated from a prespecified population covariance matrix, and eight 
popular SEM fit indices were studied. Two primary conclusions were 
suggested. First, for misspecified models, some fit indices appear to 
be noncomparable in terms of the information they provide about model 
fit; some fit indices also seem to be more sensitive to model 
misspecif ication. Second, estimation method strongly influenced 
almost all the fit indices examined, especially for misspecified 
models. These two issues do not appear to have been well documented 
in the previous literature. Perhaps the focus of most previous 
simulation studies on correctly specified models may have failed to 
detect these dynamics. It is further suggested that future research 
should study not only different models relative to model complexity, 
but also a wider range of model specification conditions, including 
correctly specified models as well as models specified incorrectly to 
varying degrees. (Contains 2 figures, 6 tables, and 26 references.) 
(Author /SLD) 
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Abstract 

A Monte Carlo simulation study was conducted to investigate the 
effects of sample size, estimation method, and model specification 
on SEM fit indices. Based on a balanced 3 x 2 x 5 design, a total of 
6,000 samples were generated from a prespecified population 
covariance matrix, and eight popular SEM fit indices were studied. 
Two primary conclusions were suggested. First, for misspecified 
models, some fit indices appear to be non-comparable in terms of 
the information they provide about model fit; some fit indices also 
seem to be more sensitive to model misspecif ication. Second, 
estimation method strongly influenced almost all the fit indices 
examined, especially for misspecified models. These two issues do 
not appear to have been previously well documented in the 
literature. Perhaps the focus of most previous simulation studies 
on correctly specified models may have failed to detect these 
dynamics. It is further suggested that future research should not 
only study different models viz a viz model complexity, but also 
study a wider range of model specification conditions, including 
correctly specified models as well as models specified incorrectly 
to varying degrees. 
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Covariance structure analysis, or structural equation modeling 
(SEM) , has been heralded as a unified model that joins methods from 
econometrics, psychometrics, sociometrics, and multivariate 
statistics (Bentler, 1994) . The generality and wide applicability 
of structural equation modeling have been amply demonstrated 
(Bentler, 1992; Joreskog & Sorbom, 1989). In recent years, SEM has 
become an increasingly popular statistical tool for researchers in 
psychology, education, and in the social and behavioral sciences in 
general. For researchers in these areas, SEM has become an 
important tool for testing theories with both experimental and non- 
experimental data (Bentler & Dudgeon, 1996) . But despite its 
popularity in a variety of research situations, some thorny issues 
still haunt SEM applications. One such prominent issue is the 
assessment of model fit. 

The assessment of model fit in SEM was initially framed within 
the dichotomous decision process of hypothesis testing: the model 
was either accepted as providing good fit to the data, or the model 
was rejected as fitting the empirical data poorly. The decision to 
accept or reject the hypothesis of fit was based on the probability 
level associated with the x 2 value, which assesses the discrepancy 
between the original sample covariance matrix and the covariance 
matrix reproduced based on model specifications. 

As is the case with statistical significance testing in 
general (Thompson, 1996) , such an assessment of model fit is 
confounded with sample size: the power of the test increases with 
increases in the sample size used in the analysis. As a result, 
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model fit assessment becomes very stringent when sample size is 
large, and a minimal discrepancy between the original sample 
covariance matrix and the reproduced covariance matrix will be 
declared statistically significant, and consequently, rejected as 
having poor fit with the empirical data. But when sample size is 
small, the statistical test is lenient, and the test may fail to 
detect meaningful differences between the sample covariance matrix 
and the covariance matrix reproduced from the specified model. 
Indices for Assessing Model Fit 

Due to the generally recognized unsatisfactory nature of x 2 
statistic for model fit assessment (Thompson & Daniel, 1996) , a 
variety of alternative indices for assessing model fit have been 
developed. Although some indices have been based on different 
theoretical rationales (Maiti & Mukherjee, 1991; Tanaka, 1993), 
many of them are superficially similar from a practical point of 
view. To get a sense about the number and variety of these 
indices, we only need to have a quick look at the output of current 
computer programs for SEM analysis. The SEM procedure under SAS 
(SAS Institute, 1990) , PROC CALIS, outputs close to two dozen fit 
indices. Following the same trend, the new version of LISREL 
program (LISREL Mainframe Version 8.12) has substantially increased 
the number and type of fit indices in its output. Clear guidelines 
are currently lacking as regards the comparability and relative 
performance of these indices under different conditions. This 
somewhat chaotic state of affairs leaves many researchers confused 
about which indices to consult or present in their research work. 
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The main reason for this situation is that different types of 
fit indices were developed under different theoretical rationales, 
and there does not seem to exist one fit index which meets all our 
expectations for an ideal fit index, even if we had a complete 
consensus regarding our expectations. Although different opinions 
have been expressed as to what characteristics an ideal fit index 
should possess (Cudeck & Henly, 1991; Tanaka, 1993) , an ideal fit 
index, as discussed by Gerbing and Anderson (1993) , may possess at 
least three characteristics: (a) has a range between 0 and 1, with 
0 indicating complete lack of fit, and 1 indicating perfect fit; 
(b) is independent of sample size; and (c) has known distributional 
properties to assist interpretation. 

Since SEM fit indices were developed with different 
rationales, they may differ across several dimensions. Tanaka 
(1993) proposed a six-dimension typology for SEM fit indices, and 
attempted to categorize some popular fit indices along these six 
dimensions. This multifaceted nature of fit indices not only makes 
the comparison among fit indices difficult, but also makes it 
nearly impossible to select the "best" index from all those 
available. 

Statistically, most popular fit indices fall into one of 
several types, and they were developed with different motivations 
(Gerbing & Anderson, 1993) . The first type of fit index — 
covariance matrix reproduction indices — attempts to assess the 
degree to which the reproduced covariance matrix based on the 
specified model has accounted for the original sample covariance 
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matrix. This type of fit index can be conceptualized as the 
multivariate counterpart of the coefficient of determination (R 2 ) 
as in regression or ANOVA analysis (Tanaka & Huba, 1989) . Examples 
of this type of fit indices are the Goodness-of-Fit Index (GFI) and 
the Adjusted Goodness-of-Fit Index (AGFI) (Joreskog & Sorbom, 
1989) . 

The second type of fit index — comparative model fit 
indices — assess model fit by evaluating the comparative fit of a 
given model with that of a more restricted null model. In 
practice, the null model is usually a model which assumes no 
relationship among the indicators of the model. Reservations have 
been expressed about the appropriateness of using such null models 
as comparative baselines (Sobel & Bohrnstedt, 1985) . Bentler and 
Bonnet's normed and non-normed fit indices (NFI and N_NFI) and 
Bollen's incremental fit index (DELTA2) both belong to this family. 

The third type of fit index — parsimony weighted indices — 
specifically takes model parsimony into consideration by imposing 
penalties for specifying more elaborate models. More specifically, 
these fit indices consider both model fit and the degrees of 
freedom used for specifying the model. If good model fit is 
obtained at the expense of freeing more parameters, a penalty will 
be imposed. The reasoning in this type of model assessment is 
embedded in the long tradition of science going back to William of 
Occam's razor: between two models that fit data equally, the 
simpler model is more likely to be true, and therefore is also more 
likely to be replicated. Besides, statistically, better fit is 
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always obtained when more parameters in the model are freed. The 
parsimony indices proposed by James, Mulaik and Brett (1982) and by 
Mulaik et al. (1989) represent this type. 

A recent development in model fit assessment makes use the 
noncentrality statistic from the noncentral x 2 distribution for 
constructing fit indices. Based on sample noncentrality statistic, 
McDonald (1989) proposed an index of noncentrality. Bentler (1990) 
proposed the Comparative Fit Index (CFI) which also uses the sample 
noncentrality statistic. As with other fit indices proposed by 
Bentler, CFI assesses model fit relative to a baseline null model. 
Some Considerations for Assessing Fit Indices 

As discussed before, one major problem caused by the variety 
of SEM fit indices is that they create confusion in research 
practice. Not only are the rationales for different indices 
unclear to many researchers, but clear guidelines are also lacking 
as regards choosing among these indices. Furthermore, most fit 
indices have unknown distributional properties, thus making 
interpretation of sample fit indices very difficult. 

The obvious reason for the lack of clear guidelines for 
choosing among different indices is that we simply do not fully 
understand the performance characteristics of these indices under 
different conditions. Due to the multifaceted nature of fit 
indices, and to different rationales for developing these indices, 
there does not seem to be a straightforward criterion against which 
the performance of all fit indices can be judged. Although it is 
not realistic to expect one straightforward criterion for judging 
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the performance of fit indices, several related criteria can be 
considered for this purpose. 

First, despite the arguments in support of the role sample 
size plays in statistical decisions (Cudeck & Henly, 1991) , the 
fact that the development of many fit indices was motivated to 
overcome the shortcomings of x 2 statistic, especially its 
sensitivity to sample size, cannot be ignored. For this reason, 
ideally, fit indices should be insensitive to or independent of 
sample size (Bollen, 1986) . This means that an index's variation 
contributed by sample size conditions should be as small as 
possible. 

Second, under the assumption of multivariate normality, model 
fitting and estimating can be accomplished through different 
statistical procedures, such as maximum likelihood (ML) or 
generalized least squares (GLS) . Ideally, fit indices should be 
invariant over this condition, i.e., different statistical theories 
should not result in excessively variable indices for the same 
data. This reasoning leads to the expectation that, ideally, 
estimation procedures should contribute relatively little to an 
index's variation. 

Third, fit indices are designed to provide information about 
the degree to which a model is correctly or incorrectly specified 
for the given data. Obviously, model misspecification should 
directly affect fit indices. Put differently, the degree of model 
misspecification should be the major contributor to the variation 
of a sample fit index. 
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Finally, as in any other statistical estimation, two criteria 
apply in assessing the relative performance of competing 
estimators: unbiasedness and variation. Between two estimators, 
the one less biased is most often preferred; between two equally 
unbiased estimators, the one with less random variation is most 
often preferred. This consideration leads to two additional 
expectations: (a) a good fit index should have as little systematic 
bias (upward or downward) as possible; and (b) an ideal fit index 
should have as little random variation as possible. 

Given the five criteria, relative performance of fit indices 
can be assessed through Monte Carlo experiments. Monte Carlo 
simulation becomes necessary mainly due to lack of theory with 
which to specify the distributions for the indices. As pointed by 
Bentler (1990), "Essentially nothing is known about the theoretical 
sampling distribution of the various estimators" (p. 245) . 
Previous Studies 

Researchers have carried out simulation studies for most SEM 
model fit indices. Although some early studies focused on x 2 

behaviors under different sample size conditions (Boomsma, 1982) , 
soon it became apparent that x 2 test was too dependent on sample 
size to be useful in many situations. As a result, many 
alternative model fit indices were developed, and the majority of 
later simulation studies put more emphasis on these alternative 
model fit indices, especially those ranging from 0 to 1. 

Invariably, all simulation studies investigated behaviors of 
model fit indices under different sample size conditions (Anderson 
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& Gerbing, 1984; Bearden, Sharma, & Teel, 1982; Bentler, 1990; 
Bollen, 1989; La Du & Tanaka, 1989; Marsh, Balia, & McDonald, 
1988) , since this has been considered a major weakness of the 
original x 2 approach, and consequently, a major concern regarding 
alternative model fit indices. The majority of fit indices 
investigated, including the normed-f it-index (NFI) , the goodness- 
of-fit index (GFI) , and the adjusted goodness-of-f it index (AGFI) , 
were shown to be influenced by sample size to different degrees. 

But since different indices were involved in different 
studies, a performance comparison of the indices across different 
simulation designs tends to be difficult. Also, for obvious 
reasons, most studies looked at early fit indices, such as GFI, 
AGFI, NFI, etc., and some newer indices, such as McDonald 
centrality, Bollen's Delta2, etc., have rarely been investigated. 

The sensitivity of some fit indices to model misspecif ication 
was examined in a few studies (Bentler, 1990; La Du & Tanaka, 1989; 
Marsh et al., 1988). The study by Marsh et al. (1988) was 
comprehensive in terms of the variety of fit indices studied, but 
the small number of replications in each cell condition (n=10) 
might have limited the generalizability of some conclusions. One 
finding from the study was that the relative fit indices, such as 
NFI, tended to be non-comparable across different studies or 
different data sets, since their values not only depended on model 
specification, but also, or more importantly, on how bad the null 
model itself was. Other studies (Bentler, 1990; La Du & Tanaka, 



1989) involved fewer indices, making performance comparison among 
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fit indices difficult. 

Very little is known about the influence of estimation methods 
on fit indices. In a few studies which examined the issue (La Du 
& Tanaka, 1989; Maiti & Mukherjee, 1991) , maximum likelihood (ML) 
and generalized least squares (GLS) estimation procedures were 
used. Estimation procedures were shown to influence the value of 
the fit indices studied. But in these studies, only very few fit 
indices were examined, and the performance of other popular indices 
were unknown. 

The purpose of the present study was to compare empirically 
the relative performance of SEM model fit indices. Three prominent 
factors which might affect SEM indices were considered: sample 
size, estimation procedure, and model misspecif ication. A three- 
factor experimental design was used to compare results across the 
Monte Carlo simulations. The variation of each fit index was 
partitioned to assess the influence of the three factors, and an 
index's behavior pattern was empirically examined. 

METHOD 

SEM Fit Indices Studied 

Although a variety of fit indices exist, some of them are not 
readily comparable with each other. For example, Akaike's 
information criterion (AIC) both has such a different metric from 
many other fit indices and is used in such a different fashion that 
a meaningful comparison between AIC and GFI is difficult. Based on 
the consideration of comparability, eight popular indices were 
chosen for the study: goodness-of-f it index (GFI) , adjusted 
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goodness-of-f it index (AGFI) , Bentler's comparative fit index 
(CFI) , McDonald's centrality index (Centrality), non-normed fit 
index (N_NFI) , normed fit index (NFI) , Bollen's normed fit index 
rhol (RHOl) , and Bollen's non-normed index delta2 (DELTA2) . All 
these fit indices have an approximate range from zero to one, with 
higher values indicating better fit, and lower values indicating 
poorer fit. The comparable scale of these indices makes the 
comparison among them more straightforward. 

Design of Monte Carlo Simulations 

A three-factor balanced experimental design was used. The 
design is graphically represented in Figure 1. Five levels of the 
sample size condition (50, 100, 200, 500, 1000) , three levels of 
model specification (true model, slightly misspecified model, 
moderately misspecified model) , and two estimation methods (maximum 
likelihood, generalized least squares) were incorporated in the 
5x3x2 design. Under this design, a total Of 6,000 (5x3x2 x 200) 
replications of SEM model fitting were conducted, with 200 
replications in each cell condition. Such a design allowed a 
systematic assessment of the impact of the three factors on fit 
indices: sample size, degree of model misspecif ication, estimation 
procedure. Also, 200 replications within each of the conditions 
provided estimates precisely enough to allow systematic comparisons 
among the fit indices on characteristics such as unbiasedness and 
degree of random variation. 



Insert Figure 1 about here 
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Model and Model Specification 

An SEM model of moderate complexity was simulated in the 
present study, as presented in Figure 2. This model was derived 
from a substantive research example described in the LISREL 
(Version 7) manual (Joreskog & Sorbom, 1989, p. 178). As suggested 
elsewhere (Gerbing & Anderson, 1993) , simulating substantively 
meaningful models in Monte Carlo simulation may increase the 
external validity of Monte Carlo research results. 

Insert Figure 2 about here 

The degree of SEM model complexity is a characteristic which 
is difficult to define, since complexity depends not only on the 
number of observed variables, but also on the number of latent 
variables, as well as on the unique relationship pattern among both 
given observed and latent variables. Most substantive studies 
using SEM involved from two to six latent variables, with about two 
to six indicators for each latent variable (Gerbing & Anderson, 
1993) . If this observation is correct, the model simulated in the 
present study, with four latent variables (two exogenous and two 
endogenous latent variables) , each of which has three or four 
indicators, could be characterized as having moderate complexity, 
though of course such characterization is inherently subjective. 

The population parameters for the true model in Figure 1 were 
artificially specified, as presented in Table 1 using LISREL 
representation. The population covariance matrix for the true 
model was obtained by using the prespecified parameters in Table 
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1 to reproduce the population covariance matrix, using the formula 
(Joreskog & Sorbom, 1989, p. 5) : 



Table 2 presents the population covariance matrix reproduced 
using SEM population parameters in Table 1 and the formula above. 
Mathematically it is guaranteed that, other than rounding errors, 
perfect fit would be obtained if the model in Figure 2 was fit to 
this population covariance matrix. Since variable means do not 
affect SEM model fitting, to simplify the data generation process, 
all variables were centered with means being zeros. All sample 
data sets were generated based on the population covariance matrix 
in Table 2. 



Although a true model is relatively easy to specify in 
simulation research, model misspecif ication is difficult to handle 
for at least two reasons: (a) model misspecif ication can take such 
a variety of forms; and (b) the degree of model misspecif ication is 
not easily quantified, so it is difficult to make a priori 
prediction about the severity of misspecif ication (Gerbing & 
Anderson, 1993) . We do not yet have solutions to these issues. In 
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the present study, model misspecif ication was achieved by fixing 
some parameters in the measurement model which should be set free, 
i.e., by setting some parameter values to be zero when, in fact, 
they were not, as indicated in Figure 2. 

The degree of model misspecif ication was empirically 
determined by fitting two misspecif ied models to the population 
covariance matrix data, and the resultant values of fit indices 
were used as indicators of severity of model misfit. The terms 
"slightly misspecif ied" and "moderately misspecif ied" are used in 
the present paper simply to indicate different degrees of 

misspecif ication in this study; by no means should these terms be 
generalized beyond the present study, unless the degrees of 

misspecif ication are operationalized in the same manner. 

Data Source 

The present study only considered sample data generated from 
multivariate normal distributions. As a result, any issues related 
to data non-normality were not investigated. Data generation was 
accomplished using the data generator under the Statistical 

Analysis System (SAS PC Window Version 6.08). For each of the 
6,000 sample data sets generated, the following steps were 

implemented: 

(1) random normal variables with a desired sample size were 
generated, using the pseudorandom number generator under 
SAS; 

the random normal variables were linearly transformed to 
have desired means and standard deviations; 



( 2 ) 
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(3) the uncorrelated variables were then transformed to 
multivariate sample data with pre-specif ied population 
inter-variable correlations, using the matrix 
decomposition procedure (Kaiser & Dickman, 1962; Vale & 
Maurelli, 1983) . 

(4) The multivariate sample data was fit to one of the three 
models under one of the two estimation procedures, using 
PROC CALIS procedure under SAS. All desired fit indices 
from the sample were obtained and saved for later 
analysis. 

Simulation programming was implemented through a combination 
of SAS Macro language, SAS PROC IML matrix language, and the SAS 
PROC CALIS procedure which implements structural equation modeling 
under the SAS environment. All simulation was carried out on an 
IBM PC Pentium 100 Mhz computer with SAS Windows Version 6.08. 
Analysis 

The major analytic strategy was to partition variation of 
sample fit indices into different components to assess the 
influence of different factors considered in the design. Since the 
design was a balanced experimental design, the variations due to 
different sources were orthogonal, which made the analysis and 
interpretation more straightforward. Factorial analysis of 
variance (ANOVA) was used as the analysis technique. This analysis 
allows us to partition the variation of a particular fit index into 
four major independent sources: sample size, estimation procedures, 
model misspecif ication, and random variation, plus some interaction 
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terms. Using the four criteria discussed previously, the behaviors 
of the eight fit indices were systematically examined, and their 
relative performance judged. 

Besides partitioning sampling variance of the fit indices to 
assess the influences of different factors, values of fit indices 
were examined to assess characteristics such as the existence, or 
lack thereof, of systematic bias, and the extent of random 
variations for different indices. 

RESULTS AND DISCUSSION 

As discussed previously, five criteria were suggested for 
judging the relative performance of the eight fit indices examined 
in the present study: (a) sensitivity to sample size, (b) 
sensitivity to estimation methods, (c) sensitivity to model 
specification, and (d) degree of unbiasedness and (d) degree of 
random variation. Since there does not seem to be any consensus in 
the literature regarding the relative importance of these five 
features, the order of discussion of these issues should not be 
interpreted as reflecting implied relative importance of the 
criteria. 

Table 3 presents descriptive data for the eight fit indices 
under different conditions: model specification, sample size 
conditions, estimation methods. Although more detailed data were 
available for the sample fit indices, e.g., confidence intervals, 
distribution characteristics (skewness, kurtosis) , range, etc., 
here we present the basic information of means and standard 



deviations. 
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Insert Table 3 about here 



Estimation Theory 

An examination of Table 3 reveals several phenomena. First, 
under the true model (Model 1) , the population values of the fit 
indices were essentially the same based on the two methods: maximum 
likelihood (ML) and generalized least squares (GLS) . Also, under 
the true model, the sample means of the fit indices under different 
sample sizes (50, 100, 200, 500, 1000) were roughly comparable, 
especially with the increase of sample size. Three indices 
(CENTRA, NFI, RHOl) seemed to be exceptions. For these three 
indices, the two estimation methods exhibited noteworthy 
differences, especially under small sample sizes. For example, 
under a sample size 50, the mean for RHOl under ML was .9006, while 
the mean under GLS was .9948. Similar differences occurred for 
CENTRA and NFI. With increased sample size, the difference of mean 
values between the two estimation methods seems to disappear. So 
under the true model , both the population values and the sample 
means of the fit indices gave the impression that the two 
estimation methods in SEM provide comparable information about 
model fit , especially when sample size is reasonably large (e.g., 
over 200) . 

However, under the two misspecified models (Model 1: slightly 
misspecif ied; Model 2: moderately misspecified), we observed some 
discrepancies between the two estimation methods both in terms of 
the population values of fit indices, and in terms of their sample 



ERIC 




SEM Fit Indices -19- 



means under different sample size conditions. Under Model 2, such 
discrepancies did not appear to be too large, except for sample 
means of some individual fit indices (e.g., CENTRA, NFI, RHOl) 
under smaller sample size conditions (50, 100). Wherever such 
discrepancies occurred, fit index values based on GLS invariably 
exceeded those based on ML. 

Under Model 3 (the moderately misspecified model) , some 
discrepancies between the two estimation methods became alarmingly 
large, to the extent that they would give very different 
impressions about model fit. For example, for GFI, population 
values based on ML and GLS were .7902 and .9473, respectively; the 
population values for AGFI based on the two methods were .6791 and 
.9195, respectively. By current standards of model fit, the former 
values in both pairs would be judged as indicating very poor fit, 
while the latter values would be construed as indicating reasonable 
fit. 

The same pattern occurred to varying degrees for the eight fit 
indices, and especially for the GFI, AGFI and CENTRA indices. 
Again, wherever discrepancies occurred, fit values based on GLS 
exceeded those based on ML, and in some cases, to considerable 
degrees. Such large discrepancies between the two estimation 
methods was not anticipated. Thus, the two estimation methods seem 
to provide somewhat dissimilar information about model fit in the 
presence of model misspecification. 

Model Specification 

Besides the comparison between the two different estimation 
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methods under different conditions, several other phenomena also 
stand out. One such phenomenon was the discrepancy in index 
performance across model specification conditions. Although 
different fit indices seemed to provide similar information about 
model fit under the true model, such was not the case for the two 
misspecified models. For example, for the slightly misspecified 
model, McDonald's centrality had a population value of .8714, while 
CFI and DELTA2 had values as high as .9798. 

This situation became worse under the moderately misspecified 
model: GFI, AGFI, and CENTRA had population values of .7902, .6791, 
and .6087, respectively, while CFI, NFI, and DELTA2 had values of 
.9272, .9269, and .9272, respectively, under the same method (ML). 
Using conventional criteria for judging model fit, these two sets 
of fit indices would suggest very different conclusions about model 
fit, with the former group suggesting poor or very poor fit, and 
the latter group suggesting reasonably good fit. The difference 
across fit indices occurred in similar degrees for the sample means 
of fit indices under different sample size conditions, as well as 
under different estimation methods, i.e., under both ML and GLS. 

These results suggest that the fit indices were differentially 
sensitive to model misspecification. As indicated by data in Table 
3, GFI, AGFI, and CENTRA were more sensitive to model 
misspecification than the other five indices, all of which are 
relative fit indices, i.e., they are constructed by comparing the 
fit of the specified model with that of a null model. 

Sampling Bias 
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Another observation based on data in Table 3 is related to 
systematic sampling bias of fit indices. It can be seen that most 
sample fit indices tended to be systematically biased downward , 
though to different degrees. For example, under the true model 
(Model 1), under sample sizes of 50, 100 and 200, GFI and AGFI 
showed fairly strong downward bias under both estimation methods, 
with sample means considerably lower than population values. The 
same was also true under the two misspecified models. Other fit 
indices exhibited similar downward bias pattern to lesser degrees. 
Of the eight fit indices, DELTA2, N_NFI, and CFI showed relatively 
slight downward sampling bias. 

Sampling downward bias under the true model was expected, due 
to ceiling effect of fit indices. But the degree of downward bias 
of a few indices under misspecified models somewhat exceeded our 
expectations. Also, stronger downward bias seemed to occur for 
those indices which showed more sensitivity to model 
misspecif ication. More specifically, GFI, AGFI, and CENTRA showed 
more severe downward bias than the other fit indices under ML 
estimation. Furthermore, other than GFI and AGFI, downward bias 
seems to have only occurred under ML estimation method, but not 
under GLS estimation. 

This absence of downward bias when using the GLS estimation 
method is probably related to the fact that GLS estimation tended 
to provide almost maximum fit index values even under Model 3 
(moderately misspecified model) . As a result, very little sampling 
variation occurred. A comparison of standard deviations between ML 




22 



SEM Fit Indices -22- 



and GLS estimation methods for the five indices (CFI, N_NFI , NFI , 
RH01, and DELTA2) indicates substantially smaller standard 
deviations under GLS than those under ML, as reported in Table 3. 
Sources of Variation in the Fit Statistics 

Table 4 presents an ANOVA partitioning of the sampling 
variance of fit indices into different sources. The row labelled 
total sum of squares (SOS) provides an indication of the sampling 
variations of different indices. As indicated by the total sums of 
squares presented, the variation of fit indices differed 
substantially under the simulation conditions represented in the 
study: while the total SOS across all conditions for CFI was 5.118, 
the SOS for CENTRA was 131.9796, a difference of 30 times! Three 
indices — GFI, AGFI, and CENTRA — which were shown earlier to be more 
sensitive to model misspecification than the other five indices , 
seem to have substantially larger variation than the other five. 

Insert Table 4 about here 

Model Misspecification . The rj 2, s for model specification 
reported in Table 4, that is, the proportion of variance associated 
with model specification, indicates that CENTRA had the highest 
proportion in its variation (50.3%) which was contributed by model 
misspecification, while GFI (37.3%), AGFI (34.1%) followed, in that 
order. As reasoned before, since an fit index is designed to 
provide information about model fit, model specification (including 
model misspecification) should be a major contributor to an index's 
total variation. Also, large variation due to model specification 
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indicates an index's sensitivity to model misspecif ication. NFI, 
RH01,and DELTA2 seemed to be least sensitive to model 
misspecif ication, as indicated by to small rj 2 s (15.1%, 16.3%, and 
13.8%) for the condition of model specification. Based on this 
criterion, CENTRA would be ranked at the top, followed by GFI and 
AGFI. NFI, RH01, and DELTA2 would be ranked at the bottom. 

Sample Size . Sample size condition strongly influenced GFI 
and AGFI, accounting for 31.5% and 34.3% of total variance, 
respectively, for these two indices. CENTRA was the index least 
susceptible to sample size condition, with only .06% of variance 
accounted for by this condition. CFI and N_NFI also had very small 
percentages of total variance accounted for by sample size 
condition (.6% and .5% respectively). The other three fit indices 
had about 10% of total variance due to sample size. The CENTRA 
index was least influenced by sample size , followed by CFI , NJNFI, 
while GFI and AGFI seemed to be overly affected by sample size. 

Estimation Method . Although the GFI and AGFI indices seemed 
to be overly influenced by sample size, GFI and AGFI were least 
influenced by estimation method, with about 10% of their total 
variation contributed by this factor. CENTRA followed GFI and AGFI 
in this regard, with about 20% of total variation accounted for by 
estimation method. The other five indices seemed to be overly 
affected by estimation method, with the percentage of total 
variation contributed by this factor ranging from 32.8% to 46.8%. 
Based on the criterion that a fit index should not be overly 
affected by estimation method, GFI and AGFI would be ranked best. 
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with CENTRA following these two. Again, NFI , RH01, and DELTA2 
would be worst. 

Specif ication-bv-Estimation Interaction . As reported in Table 
4, the interaction term between model specification (MS) and 
estimation method (ES) accounted for a moderately large proportion 
of total variances for all the fit indices, ranging from 12.2% to 
26.1%. This indicates that model specification may have a stronger 
influence on fit indices under one estimation method than under 
another. 

A close look at Table 3 suggests that this interpretation is 
probably correct: model specification had much stronger influences 
on the estimated fit indices under ML than under GLS. As a matter 
of fact, for five indices (CFI, N_NFI , NFI, RH01, and DELTA2) , 
model misspecif ication seemed to have no impact at all on the 
estimated fit indices under GLS, with all these five indices 
attaining almost maximum values even under Model 3 (moderately 
misspecified model) . 

In other words, under GLS estimation, these fit indices are 
almost totally insensitive to the model misspecif ication conditions 
implemented in the present study; and their values gave the 
impression that even the moderately misspecified model was a model 
with perfect fit to the data. These findings were unexpected, and 
they raise serious questions about the effectiveness of these fit 
indices in providing information about SEM model fit, especially 
under GLS estimation. 

Random Variation 
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Table 5 presents data on fit indices' random variation. 
Random variation was assessed through coefficients of variation 
(CV) , which is considered a scale-free measure of variation. Using 
CV to represent random variation has the advantage of avoiding the 
problem of noncomparability across variables caused by different 
measurement metrics. CV is constructed as a ratio of sample 
standard deviation (s) to sample mean (X) in percentage terms, with 
higher values representing more variation, regardless of 
measurement metric. 

The results presented in Table 5 support several observations. 
First, larger sample size resulted in smaller variation for all 
indices. This was expected, since sampling variation should 
decrease with the increase of sample size. Second, some indices 
tended to have considerably larger random variation than others. 
Leading the list of fit indices in this regard was CENTRA, with 
consistently higher CVs than other indices under almost all 
conditions (models, sample sizes, estimation methods) . Third, 
among the three models, the severity of model misspecification 
resulted in larger random variation of fit indices. Consistently, 
fit indices had larger random variation under the moderately 
misspecified model than that under slightly misspecified model, 
which, in turn, had larger variation than that under the true 
model . 

Estimation method also seems to have a strong effect on fit 
indices' random variation. Invariably, random variation was 
substantially larger under ML estimation than under GLS estimation. 
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A few indices, e.g., CFI, N_NFI, NFI, DELTA2 , had almost no random 
variation at all under GLS estimation. Although small random 
variation is generally considered as a positive aspect for a 
statistic, we suspect that the highly restricted random variation 
under GLS, especially for fit indices CFI, N_NFI, NFI, RHOl, and 
DELTA2 , was caused by a ceiling effect of these fit indices 
estimated under GLS. If we look back at Table 3, it can be seen 
that these five indices under GLS estimation always attained almost 
maximum values, under the various sample size and/or what model 
specification conditions. 

Conclusions 

These results raise two important issues in SEM analysis. In 
most SEM applications, the major purpose is theory testing. This 
purpose is realized by examining how the predicted relationship 
pattern based on a theory can be supported by empirical data. In 
other words, the fit between a theoretical model and empirical data 
is of paramount importance in SEM analysis. Unfortunately, model 
fit as a central question in SEM analysis appears to be difficult 
to address, to say the least. 

The first major issue raised by the results of the present 
study concerns the comparability of fit indices. The majority of 
previous Monte Carlo studies focused on correctly specified models, 
and much less empirical work has examined misspecified model of 
varying degrees. For a correctly specified model, fit indices seem 
to be comparable in that they all indicate that model fit is close 
to being perfect under reasonable sample size conditions. But for 




27 



SEM Fit Indices -27- 



misspecif ied models, the picture is different. 

As indicated by the results of the present study, fit indices 
may be much less comparable to each other than most researchers 
realize. For example, using ML estimation, for our Model 3 
(moderately misspecified model) , under the sample size condition of 
500 (reasonable sample size), a mean value of .7821 for GFI , as 
reported in Table 3, would certainly convey very different meaning 
about model fit from that based on a mean value of .9200 for NFI, 
or .9269 for DELTA2 . Such discrepancies among fit indices have not 
been previously documented. 

It is our belief that too much previous simulation research 
has focused on the true model . and not enough empirical work has 
been done for misspecified model . As a result, this comparability 
issue among fit indices has previously been largely ignored. Based 
on the results obtained in the present study, at least for the 
model conditions examined in the study, some fit indices appear to 
be much more sensitive to model misspecif ication (e.g., McDonald's 
Centrality, GFI, AGFI) than others (e.g., CFI, N_NFI , NFI, RHOl, 
DELTA2 ) . 

The second major issue involves intra -index comparability 
under different estimation methods. Theoretically, under 
multivariate normality conditions, ML and GLS estimation are 
asymptotically equivalent under large sample conditions (Gerbing & 
Anderson, 1993). If this is the case, empirically, we would expect 
that the discrepancy between fit indices' values under the two 
estimation methods would diminish as sample size increases. This 
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expectation, however, did not materialize. For example, for our 
Model 3 (moderately misspecified model) , even under sample size 
condition of 1000, the mean value for GFI were .7850 and .9400, 
respectively, under ML and GLS estimation, as reported in Table 3. 
In research practice, such different fit index values based on the 
same model could lead to very different conclusions about model 
fit. 

Similar intra -index discrepancy existed for other fit indices 
examined in the study. Here again, the discrepancy between 
estimation methods does not seem to be that obvious for models with 
less severe misspecif ication. Therefore, it is likely that this 
issue has been largely ignored in the literature due to the fact 
that previous focus has been on correctly specified models, and not 
enough work has been done for misspecified models. 

We asked, which fit index has relatively better performance 
under different conditions? Although the results of the present 
empirical study does not provide the final answer to this question, 
some tentative conclusions can be presented. 

To the extent that a fit index should be sensitive to model 
misspecif ication, the McDonald centrality index performed best, 
followed by GFI, and AGFI, with others trailing behind these three. 
If we desire an index which is minimally influenced by sample size, 
then the centrality index again came out to be the choice, followed 
by CFI , N_NFI , and some others. As regards sensitivity to sample 
size , the GFI and AGFI — two very commonly used indices — did not 
perform well, since sample size conditions accounted for more than 
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30% of their variations. 

It is interesting to note that, in Tanaka's typology of fit 
indices (Tanaka, 1993) , GFI, AGFI, and CFI were classified sample 
size dependent, while DELTA2 was classified sample size 
independent. Though this classification was empirically supported 
for GFI and AGFI, empirical results in the present study 
contradicted Tanaka's as regards both CFI and DELTA2 : for CFI, only 
less than one tenth of a percent of variation was attributed to 
sample size conditions, while more than ten percent of variation 
was attributed to sample size conditions for DELTA2 . 

If we desire indices which are not overly influenced by 
estimation methods (ML or GLS in the present study) , the GFI and 
AGFI indices seem to be the primary candidates, since their 
proportion of total variation which can be attributed to estimation 
method was appreciably less than other indices. This result, 
however, is based on estimation-appropriate GFI and AGFI indices, 
since different weight matrices are used in ML and GLS estimation 
to construct GFI and AGFI (SAS Institute, 1990; Tanaka, 1993) . The 
centrality index trails GFI and AGFI in this regard. Other indices 
had considerably larger proportions of their variation associated 
with estimation method. 

Downward bias occurred for almost all the fit indices 
examined , and such downward bias is more severe under smaller 
sample size conditions. For example, for our Model 1 (true model), 
under sample size 100, the 90% confidence interval (not presented 
in tables) for GFI under ML estimation would be (.9056, .9454), 
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while the population GFI was 1.0000. Although such downward bias 
is expected under the true model due to the ceiling effect of fit 
index values, similar downward bias also existed for misspecified 
models, as can be seen from Table 3. Other fit indices exhibited 
similar downward biases in varying degrees. The existence of such 
downward bias suggests that sample fit indices tend to present a 
somewhat more pessimistic picture about model fit than what is true 
in reality, especially when sample size is small. Among the fit 
indices examined here , GFI and AGFI had the most serious downward 
bias under smaller sample sizes. 

Limitations 

Several limitations of the present study should be 
acknowledged, since they may limit the generalizability of this 
study. The first limitation of the present study was that only one 
model was used as the basis for model specification condition, 
instead of a range of models varying in characteristics such as 
model complexity and different patterns of coefficient values in 
the model. Since only one model was used in the simulation, it is 
unknown to what extent the results can be replicated for other 
models, or for SEM analysis in general. The contributions of the 
present research must be augmented by further research. 

The second limitation involves the precision of the estimates 
in the study. In the present study, after a sample was generated, 
the sample was fit to one model under one estimation method only, 
and samples in each cell were independent. For example, the 200 
samples in the cell of sample size of 100, ML estimation, and True 
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Model specification were different from the other 200 samples in 
the cell of sample size of 100, ML estimation and slight model 
misspecif ication. The difference between these two cell conditions 
might be due to both model specification and sampling error. 
Although such confounding of model specification and sampling error 
may not be a statistical problem in the long run, it may affect the 
precision of study results if the number of samples in each cell is 
not large enough. To avoid such potential confounding, one sample 
could be generated and fit to all three models, instead of three 
independent samples being generated (Gerbing & Anderson, 1993) . 

SUMMARY 

A balanced 3 x 2 x 5 (three model specification conditions x two 
estimation methods x five sample size conditions) design was used 
in a Monte Carlo simulation study to investigate the effects of 
these factors on SEM fit indices, with 200 replications within each 
cell. A total of 6,000 samples were generated from a prespecified 
population covariance matrix, and each of three prespecified models 
with known specification error were fit to data. Eight popular SEM 
fit indices were studied. The results of the present study suggest 
the following: 

1. Although fit indices seem to be comparable in providing 
information about model fit for correctly specified models , 
some fit indices appear to be non-comparable for incorrectly 
specified models. Some fit indices seem to be much more 
sensitive to model misspecif ication than others, at least for 
the model conditions investigated in our study. This problem 
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has not been well documented in the literature, probably 
because previous studies focused more on correctly specified 
models, and not enough emphasis has previously been put on 
misspecified models. 

2. Estimation method (maximum likelihood versus generalized least 
squares in this study) strongly influence almost all the fit 
indices examined. This influence does not seem to be obvious 
for correctly specified or slightly misspecified models; for 
more severe model misspecif ication, however, the effect 
appears to be strong. Again, this phenomenon has not 
previously been well documented in the literature. We suspect 
that the focus of previous studies on correctly specified 
model, rather than on misspecified model in SEM research, may 
have camouflaged this potential difference. 

3. To recommend some fit indices at the expense of others is 
always difficult, since it is never certain if one particular 
study, or even a group of studies, has really captured the 
complexity of model fit within SEM analyses. Nevertheless, 
with this caveat in mind, based on the somewhat limited 
results of the study, we tentatively recommend use of 
McDonald's Centrality , followed by GFI and AGFI indices, 
mainly for their sensitivity to model misspecif ication . Other 
indices seem to have too little variance under different model 
specification conditions. 

Obviously, more research is needed to address the important 
issues raised in the present study. We suggest that future 
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research should not only examine a wider range of models in terms 
of model complexity and some other characteristics, but also study 
a wider range of model specification conditions, including both 
true model and misspecified models of varying degrees. 
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Table 1: Population Parameters for the True Model 
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Table 2 : Population Covariance Matrix for Generating Samples 
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Table 4: Eta-Sauares due to Different Sources for the Eight Fit Indices 
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Figure 2: Simulated SEM Model: True Model and Two Misspecified Models 
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